Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ctrmgrd] Fix the container restart during config reload #19528

Merged
merged 1 commit into from
Jul 23, 2024

Conversation

lolyu
Copy link
Contributor

@lolyu lolyu commented Jul 10, 2024

Why I did it

Fix the config reload failure due to the container(swss/teamd) restart introduced by ctrmgrd:

Jun 21 11:16:32.844005 xxx NOTICE swss#orchagent: :- syncd_apply_view: Notify syncd APPLY_VIEW
Jun 21 11:16:32.844005 xxx NOTICE swss#orchagent: :- notifySyncd: sending syncd: APPLY_VIEW
Jun 21 11:16:32.845219 xxx WARNING syncd#syncd: :- processNotifySyncd: syncd received APPLY VIEW, will translate
Jun 21 11:16:32.846684 xxx NOTICE syncd#syncd: :- dump: getting took 0.000931 sec
Jun 21 11:16:32.846908 xxx NOTICE syncd#syncd: :- getAsicView: ASIC_STATE switch count: 0:
Jun 21 11:16:32.847129 xxx NOTICE syncd#syncd: :- getAsicView: get asic view from ASIC_STATE took 0.002006 sec
Jun 21 11:16:32.850585 xxx NOTICE syncd#syncd: :- dump: getting took 0.001547 sec
Jun 21 11:16:32.850998 xxx NOTICE syncd#syncd: :- getAsicView: TEMP_ASIC_STATE switch count: 1:
Jun 21 11:16:32.851208 xxx NOTICE syncd#syncd: :- getAsicView: oid:0x21000000000000: objects count: 47
Jun 21 11:16:32.851400 xxx NOTICE syncd#syncd: :- getAsicView: get asic view from TEMP_ASIC_STATE took 0.003422 sec
Jun 21 11:16:32.851586 xxx ERR syncd#syncd: :- applyView: current view switches: 0 != temporary view switches: 1, FATAL
Jun 21 11:16:32.852132 xxx NOTICE syncd#syncd: :- applyView: apply took 0.006951 sec
Jun 21 11:16:32.853039 xxx ERR swss#orchagent: :- syncd_apply_view: Failed to notify syncd APPLY_VIEW -1

The APPLY_VIEW failure is due to swss is restarted twice during config reload, and the second restart
is introduced by ctrmgrd, which should be avoided.

Signed-off-by: Longxiang Lyu lolv@microsoft.com

Work item tracking
  • Microsoft ADO (number only): 28546618

How I did it

When the current owner is "none", it implies that the container is stopped, ctrmgrd should not restart it in this case.

How to verify it

UT and on testbed.

lolv@941a1adedec7:/sonic/src/sonic-ctrmgrd$ pytest -v
============================================================================================================================ test session starts =============================================================================================================================
platform linux -- Python 3.11.2, pytest-7.2.1, pluggy-1.0.0+repack -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /sonic/src/sonic-ctrmgrd, configfile: pytest.ini
plugins: pyfakefs-5.5.0, cov-4.0.0
collected 18 items

tests/container_startup_test.py::TestContainerStartup::test_start PASSED                                                                                                                                                                                               [  5%]
tests/container_test.py::TestContainer::test_start PASSED                                                                                                                                                                                                              [ 11%]
tests/container_test.py::TestContainer::test_stop_ct PASSED                                                                                                                                                                                                            [ 16%]
tests/container_test.py::TestContainer::test_kill PASSED                                                                                                                                                                                                               [ 22%]
tests/container_test.py::TestContainer::test_invalid_kill PASSED                                                                                                                                                                                                       [ 27%]
tests/container_test.py::TestContainer::test_wait PASSED                                                                                                                                                                                                               [ 33%]
tests/container_test.py::TestContainer::test_main PASSED                                                                                                                                                                                                               [ 38%]
tests/ctrmgr_iptables_test.py::TestIPTableUpdate::test_table PASSED                                                                                                                                                                                                    [ 44%]
tests/ctrmgr_tools_test.py::TestCtrmgrTools::test_tools PASSED                                                                                                                                                                                                         [ 50%]
tests/ctrmgrd_test.py::TestContainerStartup::test_server PASSED                                                                                                                                                                                                        [ 55%]
tests/ctrmgrd_test.py::TestContainerStartup::test_feature PASSED                                                                                                                                                                                                       [ 61%]
tests/ctrmgrd_test.py::TestContainerStartup::test_labels PASSED                                                                                                                                                                                                        [ 66%]
tests/kube_commands_test.py::TestKubeCommands::test_read_labels PASSED                                                                                                                                                                                                 [ 72%]
tests/kube_commands_test.py::TestKubeCommands::test_write_labels PASSED                                                                                                                                                                                                [ 77%]
tests/kube_commands_test.py::TestKubeCommands::test_join PASSED                                                                                                                                                                                                        [ 83%]
tests/kube_commands_test.py::TestKubeCommands::test_reset PASSED                                                                                                                                                                                                       [ 88%]
tests/kube_commands_test.py::TestKubeCommands::test_tag_latest PASSED                                                                                                                                                                                                  [ 94%]
tests/kube_commands_test.py::TestKubeCommands::test_clean_image PASSED                                                                                                                                                                                                 [100%]

---------- coverage: platform linux, python 3.11.2-final-0 -----------
Name                          Stmts   Miss  Cover
-------------------------------------------------
ctrmgr/__init__.py                0      0   100%
ctrmgr/container                234     10    96%
ctrmgr/container_startup.py     146     21    86%
ctrmgr/ctrmgr_iptables.py        88      7    92%
ctrmgr/ctrmgr_tools.py           99      1    99%
ctrmgr/ctrmgrd.py               376     36    90%
ctrmgr/kube_commands.py         370     44    88%
-------------------------------------------------
TOTAL                          1313    119    91%
Coverage HTML written to dir htmlcov
Coverage XML written to file coverage.xml


============================================================================================================================ 18 passed in 12.83s =============================================================================================================================
lolv@941a1adedec7:/sonic/src/sonic-ctrmgrd$

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
Copy link
Contributor

@zjswhhh zjswhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to repro the issue on lab device?

Have we tested the change with an attempt to repro?

@lolyu
Copy link
Contributor Author

lolyu commented Jul 10, 2024

Is there a way to repro the issue on lab device?

Have we tested the change with an attempt to repro?

It's hard to manually reproduce, what we can do is to apply the fix and run the nightly to see the outcome.

Copy link
Contributor

@lixiaoyuner lixiaoyuner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zjswhhh zjswhhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yxieca yxieca merged commit 1efb520 into sonic-net:master Jul 23, 2024
21 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 23, 2024
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202405: #19655

mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Jul 23, 2024
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202311: #19656

mssonicbld pushed a commit that referenced this pull request Jul 23, 2024
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
mssonicbld pushed a commit that referenced this pull request Jul 23, 2024
Signed-off-by: Longxiang Lyu <lolv@microsoft.com>
arun1355492 pushed a commit to arun1355492/sonic-buildimage that referenced this pull request Jul 26, 2024
liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants